The goal of this report is to explore a dataset containing 1599 red wine samples. Moreover, it is to highlight aspects of exploratory data analysis as part of the udacity data anlyst course. The dataset used in this report is publicly available for research through:
[@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
[Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
[bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
Our dataset consists of 13 - 1 variables, with 1599 observations. I excluded the variable X in the analysis as it is simply another ID column that R Studio doesn’t need.
## [1] 1599 13
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Number of missing values in the dataset:
## [1] 0
I decided to explore the distribution of quality ratings first. The histogram reveals that most red wines in the dataset are of medium quality (μ 5.63) on a scale between 0 (very bad) and 10 (very excellent). The worse red wine has a quality of 3 whereas the best red wine in the dataset has a max. quality score of 8. The distribution is slightly skewed to the left suggesting that there should be a higher number of wines from medium to higher quality in the dataset.
The question we want to ask is which of the follwing chemical properties can help us explain the quality ratings of red wines based on sensory data by wine experts. Lets try to find out!
All the chemical properties describing different acids in the 3 histrograms above seem to be right skewed in various forms. A closer look however reveals that the distrutions of fixed and volatile acidity are most likely skewed because of some outliers. My idea was to trim both of the variables distrutions in order to make them approximately more normal in distribution for further analysis.
Moreover, I am attempting to transform the citric acid variable using the squareroot method in order to get a clearer view of its distrution. The distrutions looks like a waveform peaking and declining at various points of the scale. Zooming in might help to understand this pattern a little better.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
A closer look at the histograms above illustrates how the distrutions of fixed and volatile acidity became more normal in distribution through limiting the x-axis.
Citric acid seems to be different. We can see that a significant part of red wines has a very little or no amount of citric acid. About 8% of red wines contain no citric acid, which makes adding ‘freshness’ and flavor to wines using citric acid rather look optional [Cortez et al., 2009]. On the other hand we can also confirm the bimodal wavepattern from the first histogram and see that the counts are peaking around 0.25 and 0.50. Perhaps adding citric acid is a rather delicate process and mastering this techniques adds to the overall quality score? Hopefully a correlation analysis will clarify these observations further.
Looking at sugars and salts in wine we can also observe a right skewed distribution. Moreover, outliers are stretching both distributions to more than double the size of where the majoroity of the data sits. I am attempting to transform both variables using a log10 method in order to get a clearer view of the distrutions as well as limiting the histogram by ignoring the 1% outliers on the very right of the scale.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
The transformatins applied helped to understand both distrutions better. While sugar still seems to be slightly skewed to the right we can observe that clorides peak very sharp at a very particular point on the scale. It seems that there is far more variation in the amount of sugar and less so in salts. I am, at this point not yet very sure about my intuition on these variables. However, I feel there might be a relationship between sugar and salt ratios and their effect on the quality rating.
As for many of the other chemical properties in the dataset free.sulfur.dioxide and total.sulfur.dioxide are right skewed. The existence of these properties in wine is important as it prevents microbial growth and the oxidation. However, concentrations over 50 ppm become evident in the nose and taste and are therefore most likely less disirable. Perhaps this can explain some of the variance in the lower quality ranks. Therefore, my intution at this point is that the distributions are right skewed because of this.
I am also attempting to transform both variables using a log10 method in order to get a clearer view of the distrutions.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Interestingly, the log10 method helped significantly to transform the variables to a more normal distribution. While total.sulfur.dioxide shows a nearly bell-type pattern in the histogram we can observe that free.sulfur.dioxide is distributed in a more scattered way. There are some outliers that contain pretty much no or very little free.sulfur.dioxide. In further evalutation of the same variable we see that its presence throughout the distribution is rather irregular. However, due to the nature of the variable itself I have a very limited intuition about its significance in relationship to quality other than what I have already quoted above.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
Desity is almost perfectly bell-shaped in its distribution. Its value on the scale depends on the percent of alcohol and sugar content in the red wine. I have daubts that density can explain a lot of variance in quality but we will see how it performs in a correlation analysis.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
Very much like Desity, PH is relativly normal in its distribution. From the text file, which describes the variables and how the data was collected, I can gather that most wines are between 3-4 on the pH scale that spans from 0 (very acidic) to 14 (very basic). The data confirms this information and is illustrating that only lower PH values are suitable for wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Sulphates are additives which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant. The first histogram shows a right skewed distribution of how the data appears as a whole. Some outliers are stretching the distribution about 1/3 of where the majority of the data sits.
I was attempting to transform the variable by limiting the histogram to 99% of the data. It appeared as if the distribution was still skewed to the right so that I decided to apply another transformation using a log10 method.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Alcohol, I would assume, is one of the most important variables in wine. There is a minimum of at least 8.5% of alcohol in every wine and a maximum of 15%. The distribution is right skewed peaking at 9.5% affecting the total mean to μ10.42.
Unfortunately, I was not able to find an applicable transformation method that could bring the distribution to a more suitable shape for analysis. However, limiting the data to 99% excluding some of the outliers could fi some degree of the skeweness. However, the peak at 9.5% remains as the main explaination for its shape in the histogram.
There are 1599 red wines included in the dataset with 12 features (“fixed.acidity”, “volatile.acidity”, “citric.acid”, “residual.sugar”, “chlorides”, “free.sulfur.dioxide”, “total.sulfur.dioxide”, “density”, “pH”, “sulphates”, “alcohol”, “quality”. The variable quality is an integer whereas all other variables are numeric.
Other observations:
Higher number of wines from medium to higher quality.
About 8% of red wines contain no citric acid.
The median for residual.sugar is 2.200 and the max is 15.500.
The main feature in the dataset is quality. My intution leads me to assume that alcohol could be another main feature. However, I lack the necessary domain knowledge on chemcial properties and wine to identfy other main features from the analysis above and wihtout further statistical investigation.
I have a feeling that the ratio of precense of one chemical property with the absence of another could possibly explain some of the variance in quality.
I lack the necessary domain knowledge on chemcial properties and wine to identfy appropriate opurtunities in the dataset that would make such operation useful.
I have transformed quality to a factor as it is a categorical variable on a scale between 0 (very bad) and 10 (very excellent). Moreover, I have performed transformations on the following few variables in the dataset as many of them have been right skewed: citric.acid.sqr, residual.sugar.log10, chlorides.log10, free.sulfur.dioxide.log10, total.sulfur.dioxide.log10, sulphates.log10. The transformations have been sucessful exept for citric.acid.sqr. This variable still shows a bimodal wavepattern distribution with counts peaking around 0.25 and 0.50.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00 -0.26 0.67
## volatile.acidity -0.26 1.00 -0.55
## citric.acid 0.67 -0.55 1.00
## residual.sugar 0.11 0.00 0.14
## chlorides 0.09 0.06 0.20
## free.sulfur.dioxide -0.15 -0.01 -0.06
## total.sulfur.dioxide -0.11 0.08 0.04
## density 0.67 0.02 0.36
## pH -0.68 0.23 -0.54
## sulphates 0.18 -0.26 0.31
## alcohol -0.06 -0.20 0.11
## quality 0.12 -0.39 0.23
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.11 0.09 -0.15
## volatile.acidity 0.00 0.06 -0.01
## citric.acid 0.14 0.20 -0.06
## residual.sugar 1.00 0.06 0.19
## chlorides 0.06 1.00 0.01
## free.sulfur.dioxide 0.19 0.01 1.00
## total.sulfur.dioxide 0.20 0.05 0.67
## density 0.36 0.20 -0.02
## pH -0.09 -0.27 0.07
## sulphates 0.01 0.37 0.05
## alcohol 0.04 -0.22 -0.07
## quality 0.01 -0.13 -0.05
## total.sulfur.dioxide density pH sulphates alcohol
## fixed.acidity -0.11 0.67 -0.68 0.18 -0.06
## volatile.acidity 0.08 0.02 0.23 -0.26 -0.20
## citric.acid 0.04 0.36 -0.54 0.31 0.11
## residual.sugar 0.20 0.36 -0.09 0.01 0.04
## chlorides 0.05 0.20 -0.27 0.37 -0.22
## free.sulfur.dioxide 0.67 -0.02 0.07 0.05 -0.07
## total.sulfur.dioxide 1.00 0.07 -0.07 0.04 -0.21
## density 0.07 1.00 -0.34 0.15 -0.50
## pH -0.07 -0.34 1.00 -0.20 0.21
## sulphates 0.04 0.15 -0.20 1.00 0.09
## alcohol -0.21 -0.50 0.21 0.09 1.00
## quality -0.19 -0.17 -0.06 0.25 0.48
## quality
## fixed.acidity 0.12
## volatile.acidity -0.39
## citric.acid 0.23
## residual.sugar 0.01
## chlorides -0.13
## free.sulfur.dioxide -0.05
## total.sulfur.dioxide -0.19
## density -0.17
## pH -0.06
## sulphates 0.25
## alcohol 0.48
## quality 1.00
In order to get a better visual representation of the correlations, a heatmap of correlations is shown below. Warm colors indicate negative correlations whereas cold colors indicate positive correlations.
At this point I was also wondering how the transformed variables would perform with each other and if there is any significant improvements. The pairs.panel below is intended to show just that.
The heatmap above provides some intersting insides about the nature of the correlations without including any of the transformed variables. Moreover, the pairs.panels function from the psych package shows the relationship of the transformed variables.
Generally, the bivariate correlations are rather weak and also the transformed variables are not necessarily performing better. The correlation coefficient improves slightly but not in a meaningful way that would justify further investigation.
The variables that correspond the strongest with quality are volatile.acidity and alcohol whereby the relationship itself is weak. The direction of volatile.acidity with quality is negative meaning that lower levels of volatile.acidity tend to have a better quality, which is shown in the boxplot below.
## dataset$quality.factor: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4400 0.6475 0.8450 0.8845 1.0100 1.5800
## --------------------------------------------------------
## dataset$quality.factor: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.230 0.530 0.670 0.694 0.870 1.130
## --------------------------------------------------------
## dataset$quality.factor: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.180 0.460 0.580 0.577 0.670 1.330
## --------------------------------------------------------
## dataset$quality.factor: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.3800 0.4900 0.4975 0.6000 1.0400
## --------------------------------------------------------
## dataset$quality.factor: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4039 0.4850 0.9150
## --------------------------------------------------------
## dataset$quality.factor: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2600 0.3350 0.3700 0.4233 0.4725 0.8500
On the other hand alcohol has a postive relationship with quality meaning, to put it cautiously, that its presence tend to influence a higher quality in wine as shown in the boxplot below. Interstingly, medium quality wine has the biggest range in alcohol. I wonder if the combination with volatile.acidity can reveal more insights?
## dataset$quality.factor: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.575 11.000
## --------------------------------------------------------
## dataset$quality.factor: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## dataset$quality.factor: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## dataset$quality.factor: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## dataset$quality.factor: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## dataset$quality.factor: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
## dataset$quality.factor: 3
## [1] 8.4 11.0
## --------------------------------------------------------
## dataset$quality.factor: 4
## [1] 9.0 13.1
## --------------------------------------------------------
## dataset$quality.factor: 5
## [1] 8.5 14.9
## --------------------------------------------------------
## dataset$quality.factor: 6
## [1] 8.4 14.0
## --------------------------------------------------------
## dataset$quality.factor: 7
## [1] 9.2 14.0
## --------------------------------------------------------
## dataset$quality.factor: 8
## [1] 9.8 14.0
## # A tibble: 6 x 4
## quality.factor count mean sd
## <fct> <int> <dbl> <dbl>
## 1 3 10 9.96 0.818
## 2 4 53 10.3 0.935
## 3 5 681 9.90 0.737
## 4 6 638 10.6 1.05
## 5 7 199 11.5 0.962
## 6 8 18 12.1 1.22
Looking at the relationships between the supporting variables I can see that total.sulfur.dioxide and free.sulfur.dioxide moderatly correlate with each other. According to the text file which describes the variables this makes a lot of sense as free.sulfur.dioxide is part of the meassurement of total.sulfur.dioxide. I decided to not further investigate this relationship.
We can also see a weak to moderate negative correlation between density and alcahol. This makes sense as well as density of water is close to that of wine depending on the percent alcohol and sugar content.
The first scatterplot above shows the moderate negative relationship of density and alcohol of -0.50. That means as one variables increases, the other variable decreases.
On the other hand density and fixed.acidity have a moderate to strong postive relationship of 0.67. That means fixed.acidity tends to increase with density.
Both scatterplots above attempt to explore the relationships between citric.acid (-0.54) and pH as well as fixed.acidity and pH (-0.68) found in the correlation matrix. The negative relationships can be seen in the dottet scatterplot-cloud. Again, I wonder how some of these variables related to acidity, pH, density and alcohol might support each other with regards to quality during the bivariate analysis.
I will focus on exploring the statistical relationship between alcohol and acidity on quality as a factor. I am showing an initial one-way ANOVA test below including plots checking homogeneity of variance and normality. I am going to start building a model with alcohol ~ factor(quality):
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(quality) 5 483.9 96.79 115.9 <2e-16 ***
## Residuals 1593 1330.8 0.84
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Points 653 and 145 are detected as outliers. I want to have a look at a similar model with volatile.acidity ~ factor(quality) before I further optimize the model.
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(quality) 5 8.22 1.645 60.91 <2e-16 ***
## Residuals 1593 43.01 0.027
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Points 128, 127 and 1300 are detected as outliers. I will remove them along with the outliers from alcohol ~ factor(quality) and re-run the calculation in order to minimize affects on normality and homogeneity of variance.
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 1 378.2 378.2 480.1 <2e-16 ***
## Residuals 1561 1229.6 0.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
After subsetting to about 99% of the distribution in alcohol I could sucessfully remove all the relevant outliers.
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 1 5.79 5.786 239.3 <2e-16 ***
## Residuals 1561 37.75 0.024
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
After subsetting to about 99% of the distribution in volatile.acidity I could sucessfully remove all the relevant outliers.
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 5 23.013 < 2.2e-16 ***
## 1557
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Shapiro-Wilk normality test
##
## data: aov_residuals.alcohol
## W = 0.97013, p-value < 2.2e-16
After subsetting to about 99% of the distribution in alcohol and volatile.acidity I could sucessfully remove all the relevant outliers. Moreover, I ran a Levene’s Test with a p-value that is not less than the significance level of 0.05. This means that there is no evidence to suggest that the variance across groups is statistically significantly different. Moreover, the Shapiro-Wilk test on the ANOVA residuals (W = 0.96, p = < 0.05) shows that, even after transforming the data, normality is violated.
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 5 1.5539 0.1701
## 1557
##
## Shapiro-Wilk normality test
##
## data: aov_residuals.volatile.acidity
## W = 0.99219, p-value = 2.287e-07
I also ran a Levene’s Test with a p-value that is not less than the significance level of 0.05 for volatile.acidity. Moreover, the Shapiro-Wilk test on the ANOVA residuals (W = 0.96, p = < 0.05) normality is violated indicating that both independent variables population is not normal:
I am investigating further by using a Kruskal-Wallis rank sum test. This will allow me to see if levels of alcohol in lower level wines have identical data distributions than levels of alcohol in medium to higher quality wines without assuming the data to have normal distribution at a .05 significance level.
##
## Kruskal-Wallis rank sum test
##
## data: alcohol by quality
## Kruskal-Wallis chi-squared = 398.93, df = 5, p-value < 2.2e-16
##
## Pairwise comparisons using Wilcoxon rank sum test
##
## data: rm_outliers_subset$alcohol and rm_outliers_subset$quality
##
## 3 4 5 6 7
## 4 0.21614 - - - -
## 5 0.67190 0.06119 - - -
## 6 0.01232 0.00888 < 2e-16 - -
## 7 7.8e-05 2.8e-11 < 2e-16 < 2e-16 -
## 8 0.00134 5.8e-05 4.2e-08 0.00022 0.16378
##
## P value adjustment method: BH
After seeing that the test is signifaicant I decided to apply the same method for volatile.acidity.
##
## Kruskal-Wallis rank sum test
##
## data: volatile.acidity by quality
## Kruskal-Wallis chi-squared = 218.48, df = 5, p-value < 2.2e-16
##
## Pairwise comparisons using Wilcoxon rank sum test
##
## data: rm_outliers_subset$volatile.acidity and rm_outliers_subset$quality
##
## 3 4 5 6 7
## 4 0.4235 - - - -
## 5 0.0409 0.0080 - - -
## 6 0.0049 5.2e-07 < 2e-16 - -
## 7 0.0004 5.1e-13 < 2e-16 1.7e-13 -
## 8 0.0025 6.8e-05 5.8e-05 0.0099 0.8603
##
## P value adjustment method: BH
Interestingly, this data reveals that pretty much only the medium quality levels differ from each other. We see significant differences from 5-4, from 5-6 and from 6-7. Moreover, the variable from the very top correlate with the bottom quality categories.
Overall, the relationships between the feature of interest and the other features in the dataset present themself rather weak. Even the transformation of some of the variables could not really make a difference to change this perception. The strongest relationship with quality has alcohol meaning that higher levels of alcohol moderately correlate with higher quality ratings.
The variable volatile.acidity has the second strongest relationship with quality and among all the other variables in the dataset it is the only one left worth mentioning in the context of quality.
Alcohol and volatile.acidity have the strongest relationshop with quality. With significant p-values from both performed Kruskal-Wallis rank sum tests I can assume that the variation of alcohol and volatile.acidity among different quality categories is much larger than the variation of alcohol and volatile.acidity within each quality category. Hence I could conclude that there is a significant relationship between quality categories and alcohol as well as volatile.acidity.
My first intention for the multivariate section was to color some of the scatterplots from above to see if quality will reveal some additional patterns.
The coloring with quality.factor.
Moreover, I was trying to explore additional variables with density, alcohol and quality.
I have also highlighted quality.factor in the colors with fixed.acidity and pH as well as citric.acid and pH.
My idea for this part of the assigment was to add the quality factor to the existing plots I created. The coloring with quality.factor indicates a tendency that higher qualities tend to have more alcohol and less density whereas higher desity and lower levels of alcohol indicate more medium to lower quality wines.
It also seems like as if quality is slightly layered with fixed.acidity and density. This could mean that higher quality wines tend to have higher amounts of fixed.acidity while having slighty lower amounts of density as medium quality wines.
Moreover, I was trying to explore additional variables with density, alcohol and quality that I had originally found being minimally connected with quality from the cor table. However, the plots do not really reveal any new secrets worth following up on. I have also highlighted quality.factor in the colors with fixed.acidity and pH as well as citric.acid and pH. This, however, illustrates how neither of these relationships is tied to quality in any meaningful and visual way.
Unfortunately, I haven’t been able to un-cover any new insights in this section that really suprised me. I can imagine that a more sophisticated familiarity with the chemical properties in the dataset could result in better insights.
I have not created any models beyond the bivariate models through the lack of knowledge of how to choose an appropriate statistical procedure. Further investigation would be needed.
I choose a box plot for my first two plots as I believe it is the best way to visualize the most important data in this assignment. The box plot show the distributions of alcohol within different quality groups, along with the median, range and outliers. The width of the boxes is proportional to the number of observation it contains. We can see that most of the wines are distributed between 5 and 6. These two categories also have the greatest variance, which is especially effecting the category 5. Overall, however, we can see a clear trendline underpinning the staticial results that there is a significant difference in the distribution of alcohol in the among some of the categories.
I also choose a box plot for my second plot in order to highlight the second most important variable explaining some of the variance in quality. Just like the first box it shows the distributions of the variable within different quality groups, along with the median, range and outliers. We can see that most of the wines are distributed between 5 and 6 and that these two categories have the greatest variance. Overall the trendline is more clear than in the first plot as these outliers are a little bit more distributed. The plot is also underpinning the staticial results from the Kruskal-Wallis test showing that there is a significant difference in the distribution of volatile.acidity among some of the categories.
The third and last plot is a summarisation of how I would want to further explore the model explaining the variance in quality. It is an attempt to combine the most important variable in the eploratiry data analysis. Alcohol and quality have the strongest relationship in the analysis so far. My thought was to use this relationship and expand with density as density correlate very well back with alcohol. The less dense the wine is the more alcohol it seems to contain as the negative relationship suggests. I was coloring this relationship with the factor quality to see if this can further clarify. It looks like as if lower quality wines tend to have less alcohol, which in turn results in higher density. The size of the points in the scatterplot is determined by the level of volatile.acidity as I was hoping it would add to the overall significance of the plot. Unfortunately, I don’t see the size by volatile.acidity to be that informative. ——
It was a very challenging assigment for me but I learned a lot of new things. I am especially happy about having worked so much with ggplot as this stage. I feel super comfortable using it and resolving issues I havn’t figured out about the tool yet. From an analysis perspective I had thought that I might be able to find similar significant results as in the diamond dataset. While I am happy with some of the finding I have presented I feel that the report lacks to deliver any suprising or outstanding results. The strongest results I was able to show within the scope of the exploratoy data analysis was the influence alcohol has on quality. After all it seems that the fun factor about wine is the biggest predictor that determines a good wine. Probably very much to the disaffection of sommeliers. On the other hand it became clear that the right amount of accidity is an important factor for quality as well. Too much of its presence tends to result in lower quality scores. With alcohol and volatile acidity being the main contributors for quality I could imagine to continue deep diving in an advanced analysis. Perhaps there could be a way to merge variables describing accidity in order to understand their affect on quality better.
I was also quite busy with my dayjob so I’m really happy that I was actually able to subbmit this assignment way past due date. One aspect I have mentioned a lot is domain knowledge. With more time available I would try to study the variables in more detail in order to finetune the analysis and derive more insights.
References:
R for Data Science Book by Garrett Grolemund and Hadley Wickham https://www.r-bloggers.com/
http://www.sthda.com/english/wiki/one-way-anova-test-in-r